Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns


# To tune model, get different metric scores and split data
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score,accuracy_score,recall_score,precision_score
from sklearn import metrics

# To build a logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#For making pipelines
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To suppress the warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset¶

In [2]:
#Loading the orginal training dataset
org_data = pd.read_csv('/Users/anshamohammed/Desktop/Drive G/specialised course/Feature_eng/Project/Train.csv.csv')
org_data
Out[2]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.464606 -4.679129 3.101546 0.506130 -0.221083 -2.032511 -2.910870 0.050714 -1.522351 3.761892 ... 3.059700 -1.690440 2.846296 2.235198 6.667486 0.443809 -2.369169 2.950578 -3.480324 0
1 3.365912 3.653381 0.909671 -1.367528 0.332016 2.358938 0.732600 -4.332135 0.565695 -0.101080 ... -1.795474 3.032780 -2.467514 1.894599 -2.297780 -1.731048 5.908837 -0.386345 0.616242 0
2 -3.831843 -5.824444 0.634031 -2.418815 -1.773827 1.016824 -2.098941 -3.173204 -2.081860 5.392621 ... -0.257101 0.803550 4.086219 2.292138 5.360850 0.351993 2.940021 3.839160 -4.309402 0
3 1.618098 1.888342 7.046143 -1.147285 0.083080 -1.529780 0.207309 -2.493629 0.344926 2.118578 ... -3.584425 -2.577474 1.363769 0.622714 5.550100 -1.526796 0.138853 3.101430 -1.277378 0
4 -0.111440 3.872488 -3.758361 -2.982897 3.792714 0.544960 0.205433 4.848994 -1.854920 -6.220023 ... 8.265896 6.629213 -10.068689 1.222987 -3.229763 1.686909 -2.163896 -3.644622 6.510338 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 -2.071318 -1.088279 -0.796174 -3.011720 -2.287540 2.807310 0.481428 0.105171 -0.586599 -2.899398 ... -8.273996 5.745013 0.589014 -0.649988 -3.043174 2.216461 0.608723 0.178193 2.927755 1
19996 2.890264 2.483069 5.643919 0.937053 -1.380870 0.412051 -1.593386 -5.762498 2.150096 0.272302 ... -4.159092 1.181466 -0.742412 5.368979 -0.693028 -1.668971 3.659954 0.819863 -1.987265 0
19997 -3.896979 -3.942407 -0.351364 -2.417462 1.107546 -1.527623 -3.519882 2.054792 -0.233996 -0.357687 ... 7.112162 1.476080 -3.953710 1.855555 5.029209 2.082588 -6.409304 1.477138 -0.874148 0
19998 -3.187322 -10.051662 5.695955 -4.370053 -5.354758 -1.873044 -3.947210 0.679420 -2.389254 5.456756 ... 0.402812 3.163661 3.752095 8.529894 8.450626 0.203958 -7.129918 4.249394 -6.112267 0
19999 -2.686903 1.961187 6.137088 2.600133 2.657241 -4.290882 -2.344267 0.974004 -1.027462 0.497421 ... 6.620811 -1.988786 -1.348901 3.951801 5.449706 -0.455411 -2.202056 1.678229 -1.974413 0

20000 rows × 41 columns

In [3]:
#Making a copy
df = org_data.copy()
org_data
Out[3]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.464606 -4.679129 3.101546 0.506130 -0.221083 -2.032511 -2.910870 0.050714 -1.522351 3.761892 ... 3.059700 -1.690440 2.846296 2.235198 6.667486 0.443809 -2.369169 2.950578 -3.480324 0
1 3.365912 3.653381 0.909671 -1.367528 0.332016 2.358938 0.732600 -4.332135 0.565695 -0.101080 ... -1.795474 3.032780 -2.467514 1.894599 -2.297780 -1.731048 5.908837 -0.386345 0.616242 0
2 -3.831843 -5.824444 0.634031 -2.418815 -1.773827 1.016824 -2.098941 -3.173204 -2.081860 5.392621 ... -0.257101 0.803550 4.086219 2.292138 5.360850 0.351993 2.940021 3.839160 -4.309402 0
3 1.618098 1.888342 7.046143 -1.147285 0.083080 -1.529780 0.207309 -2.493629 0.344926 2.118578 ... -3.584425 -2.577474 1.363769 0.622714 5.550100 -1.526796 0.138853 3.101430 -1.277378 0
4 -0.111440 3.872488 -3.758361 -2.982897 3.792714 0.544960 0.205433 4.848994 -1.854920 -6.220023 ... 8.265896 6.629213 -10.068689 1.222987 -3.229763 1.686909 -2.163896 -3.644622 6.510338 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 -2.071318 -1.088279 -0.796174 -3.011720 -2.287540 2.807310 0.481428 0.105171 -0.586599 -2.899398 ... -8.273996 5.745013 0.589014 -0.649988 -3.043174 2.216461 0.608723 0.178193 2.927755 1
19996 2.890264 2.483069 5.643919 0.937053 -1.380870 0.412051 -1.593386 -5.762498 2.150096 0.272302 ... -4.159092 1.181466 -0.742412 5.368979 -0.693028 -1.668971 3.659954 0.819863 -1.987265 0
19997 -3.896979 -3.942407 -0.351364 -2.417462 1.107546 -1.527623 -3.519882 2.054792 -0.233996 -0.357687 ... 7.112162 1.476080 -3.953710 1.855555 5.029209 2.082588 -6.409304 1.477138 -0.874148 0
19998 -3.187322 -10.051662 5.695955 -4.370053 -5.354758 -1.873044 -3.947210 0.679420 -2.389254 5.456756 ... 0.402812 3.163661 3.752095 8.529894 8.450626 0.203958 -7.129918 4.249394 -6.112267 0
19999 -2.686903 1.961187 6.137088 2.600133 2.657241 -4.290882 -2.344267 0.974004 -1.027462 0.497421 ... 6.620811 -1.988786 -1.348901 3.951801 5.449706 -0.455411 -2.202056 1.678229 -1.974413 0

20000 rows × 41 columns

Data Overview¶

  • Observations
  • Sanity checks
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB
  • Some columns have missing values and need to be imputed
  • All the variables are float data type
In [5]:
df.describe().T
Out[5]:
count mean std min 25% 50% 75% max
V1 19982.0 -0.271996 3.441625 -11.876451 -2.737146 -0.747917 1.840112 15.493002
V2 19982.0 0.440430 3.150784 -12.319951 -1.640674 0.471536 2.543967 13.089269
V3 20000.0 2.484699 3.388963 -10.708139 0.206860 2.255786 4.566165 17.090919
V4 20000.0 -0.083152 3.431595 -15.082052 -2.347660 -0.135241 2.130615 13.236381
V5 20000.0 -0.053752 2.104801 -8.603361 -1.535607 -0.101952 1.340480 8.133797
V6 20000.0 -0.995443 2.040970 -10.227147 -2.347238 -1.000515 0.380330 6.975847
V7 20000.0 -0.879325 1.761626 -7.949681 -2.030926 -0.917179 0.223695 8.006091
V8 20000.0 -0.548195 3.295756 -15.657561 -2.642665 -0.389085 1.722965 11.679495
V9 20000.0 -0.016808 2.160568 -8.596313 -1.494973 -0.067597 1.409203 8.137580
V10 20000.0 -0.012998 2.193201 -9.853957 -1.411212 0.100973 1.477045 8.108472
V11 20000.0 -1.895393 3.124322 -14.832058 -3.922404 -1.921237 0.118906 11.826433
V12 20000.0 1.604825 2.930454 -12.948007 -0.396514 1.507841 3.571454 15.080698
V13 20000.0 1.580486 2.874658 -13.228247 -0.223545 1.637185 3.459886 15.419616
V14 20000.0 -0.950632 1.789651 -7.738593 -2.170741 -0.957163 0.270677 5.670664
V15 20000.0 -2.414993 3.354974 -16.416606 -4.415322 -2.382617 -0.359052 12.246455
V16 20000.0 -2.925225 4.221717 -20.374158 -5.634240 -2.682705 -0.095046 13.583212
V17 20000.0 -0.134261 3.345462 -14.091184 -2.215611 -0.014580 2.068751 16.756432
V18 20000.0 1.189347 2.592276 -11.643994 -0.403917 0.883398 2.571770 13.179863
V19 20000.0 1.181808 3.396925 -13.491784 -1.050168 1.279061 3.493299 13.237742
V20 20000.0 0.023608 3.669477 -13.922659 -2.432953 0.033415 2.512372 16.052339
V21 20000.0 -3.611252 3.567690 -17.956231 -5.930360 -3.532888 -1.265884 13.840473
V22 20000.0 0.951835 1.651547 -10.122095 -0.118127 0.974687 2.025594 7.409856
V23 20000.0 -0.366116 4.031860 -14.866128 -3.098756 -0.262093 2.451750 14.458734
V24 20000.0 1.134389 3.912069 -16.387147 -1.468062 0.969048 3.545975 17.163291
V25 20000.0 -0.002186 2.016740 -8.228266 -1.365178 0.025050 1.397112 8.223389
V26 20000.0 1.873785 3.435137 -11.834271 -0.337863 1.950531 4.130037 16.836410
V27 20000.0 -0.612413 4.368847 -14.904939 -3.652323 -0.884894 2.189177 17.560404
V28 20000.0 -0.883218 1.917713 -9.269489 -2.171218 -0.891073 0.375884 6.527643
V29 20000.0 -0.985625 2.684365 -12.579469 -2.787443 -1.176181 0.629773 10.722055
V30 20000.0 -0.015534 3.005258 -14.796047 -1.867114 0.184346 2.036229 12.505812
V31 20000.0 0.486842 3.461384 -13.722760 -1.817772 0.490304 2.730688 17.255090
V32 20000.0 0.303799 5.500400 -19.876502 -3.420469 0.052073 3.761722 23.633187
V33 20000.0 0.049825 3.575285 -16.898353 -2.242857 -0.066249 2.255134 16.692486
V34 20000.0 -0.462702 3.183841 -17.985094 -2.136984 -0.255008 1.436935 14.358213
V35 20000.0 2.229620 2.937102 -15.349803 0.336191 2.098633 4.064358 15.291065
V36 20000.0 1.514809 3.800860 -14.833178 -0.943809 1.566526 3.983939 19.329576
V37 20000.0 0.011316 1.788165 -5.478350 -1.255819 -0.128435 1.175533 7.467006
V38 20000.0 -0.344025 3.948147 -17.375002 -2.987638 -0.316849 2.279399 15.289923
V39 20000.0 0.890653 1.753054 -6.438880 -0.272250 0.919261 2.057540 7.759877
V40 20000.0 -0.875630 3.012155 -11.023935 -2.940193 -0.920806 1.119897 10.654265
Target 20000.0 0.055500 0.228959 0.000000 0.000000 0.000000 0.000000 1.000000
In [174]:
df.Target.value_counts()
Out[174]:
0    18890
1     1110
Name: Target, dtype: int64
In [172]:
df.Target.value_counts().values[0]/df.Target.value_counts().values.sum()
Out[172]:
0.9445
In [173]:
df.Target.value_counts().values[1]/df.Target.value_counts().values.sum()
Out[173]:
0.0555
  • This dataset is heavily imbalanced. 94.45% data has 0th category value and only 5.55% data has 1th category.
In [6]:
df.isnull().sum()[df.isnull().sum() > 0]
Out[6]:
V1    18
V2    18
dtype: int64

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

In [176]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, hue = 'Target'
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Plotting all the features at one go¶

In [178]:
for feature in df.columns:
    histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
In [21]:
plt.figure(figsize=(25,20))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
Out[21]:
<AxesSubplot: >

Data Pre-processing¶

Cheaking duplicates

In [8]:
df.duplicated().sum()
Out[8]:
0

No duplicate values

Missing value imputation¶

In [9]:
df.isnull().sum()[df.isnull().sum() > 0]
Out[9]:
V1    18
V2    18
dtype: int64
  • There are 18 missing values which is imputed using mean and median for 2 different variable
In [10]:
imputation_dict = {'V1': df.V1.median(), 'V2' : df.V2.mean()}
df.fillna(imputation_dict, inplace= True)
In [11]:
df.isnull().sum()
Out[11]:
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

Removing corelated values

In [12]:
correlation = df.corr()
In [13]:
#Finding the correlated features
highly_correlated_cols = set()
for i in range(len(correlation.columns)):
    for j in range(i):
        if abs(correlation.iloc[i, j]) > 0.8:
            colname = correlation.columns[i]
            highly_correlated_cols.add(colname)
highly_correlated_cols       
Out[13]:
{'V14', 'V15', 'V16', 'V21', 'V29', 'V32'}
In [14]:
#Droping the corelated features
df.drop(highly_correlated_cols, axis = 1, inplace= True)
In [15]:
#again checking the correlation
plt.figure(figsize=(25,20))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
Out[15]:
<AxesSubplot: >

outlier treatment

In [179]:
#creating a function for outlier treatment. Here using vinserisation technique
def outlier_tretment(df,column):
    cap = df[column].quantile(.75) + (1.5 * (df[column].quantile(.75) - df[column].quantile(.25)))
    floor = df[column].quantile(.25)  - (1.5 * (df[column].quantile(.75) - df[column].quantile(.25)))
    df.loc[df[column] >= cap,column] = cap
    df.loc[df[column] <= floor,column] = floor
    print('{} number ourliers are floored to {} and {} number ourliers are capped to {} for {}'.format(sum(df[column] >= cap),cap,sum(df[column] <= floor),floor,column))
    return df
In [180]:
df.columns.drop('Target')
Out[180]:
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V17', 'V18', 'V19', 'V20', 'V22', 'V23', 'V24', 'V25',
       'V26', 'V27', 'V28', 'V30', 'V31', 'V33', 'V34', 'V35', 'V36', 'V37',
       'V38', 'V39', 'V40'],
      dtype='object')
In [181]:
#Doing the Outlier treatment for all columns
for i in df.columns.drop('Target'):
    df = outlier_tretment(df,i)
df
204 number ourliers are floored to 8.697038993 and 10 number ourliers are capped to -9.595468691
85 number ourliers are floored to 8.812472241125 and 98 number ourliers are capped to -7.907372703875001
215 number ourliers are floored to 11.105121678 and 60 number ourliers are capped to -6.332096354
139 number ourliers are floored to 8.848026312624999 and 89 number ourliers are capped to -9.065071370375
81 number ourliers are floored to 5.654610273874999 and 32 number ourliers are capped to -5.849736979125
65 number ourliers are floored to 4.471681297875 and 90 number ourliers are capped to -6.4385896071249995
203 number ourliers are floored to 3.6056262381250006 and 88 number ourliers are capped to -5.412857980875001
32 number ourliers are floored to 8.271409967 and 159 number ourliers are capped to -9.191109786999998
93 number ourliers are floored to 5.76546797325 and 55 number ourliers are capped to -5.85123781075
53 number ourliers are floored to 5.809430781 and 161 number ourliers are capped to -5.743597170999999
136 number ourliers are floored to 6.180870850124999 and 122 number ourliers are capped to -9.984368858875
82 number ourliers are floored to 9.523405883875 and 66 number ourliers are capped to -6.348465309124999
108 number ourliers are floored to 8.985032979875 and 195 number ourliers are capped to -5.748692261125
99 number ourliers are floored to 8.495293339749999 and 197 number ourliers are capped to -8.642152822249999
480 number ourliers are floored to 7.035300772125 and 251 number ourliers are capped to -4.8674477688749995
42 number ourliers are floored to 10.308498732124999 and 107 number ourliers are capped to -7.865368428875
66 number ourliers are floored to 9.930359781625 and 87 number ourliers are capped to -9.850940157375
89 number ourliers are floored to 5.24117474725 and 156 number ourliers are capped to -3.33370782275
35 number ourliers are floored to 10.777507935374999 and 59 number ourliers are capped to -11.424514369625
221 number ourliers are floored to 11.067031360749999 and 86 number ourliers are capped to -8.989118585249999
40 number ourliers are floored to 5.54054729775 and 74 number ourliers are capped to -5.50861358825
99 number ourliers are floored to 10.83188680575 and 143 number ourliers are capped to -7.039712204250001
155 number ourliers are floored to 10.951426229875 and 23 number ourliers are capped to -12.414572493125
106 number ourliers are floored to 4.196537342625 and 75 number ourliers are capped to -5.991870726375
34 number ourliers are floored to 7.891242076125001 and 212 number ourliers are capped to -7.722126908875
118 number ourliers are floored to 9.553376783500001 and 111 number ourliers are capped to -8.6404610585
225 number ourliers are floored to 9.002119060375 and 158 number ourliers are capped to -8.989842008624999
249 number ourliers are floored to 6.7978123625 and 554 number ourliers are capped to -7.497861389500001
182 number ourliers are floored to 9.656608620000002 and 133 number ourliers are capped to -5.256060092000001
127 number ourliers are floored to 11.37556147175 and 134 number ourliers are capped to -8.33543197025
134 number ourliers are floored to 4.8225600135 and 6 number ourliers are capped to -4.9028459125
81 number ourliers are floored to 10.179954516625 and 84 number ourliers are capped to -10.888193428375
85 number ourliers are floored to 5.55222571925 and 110 number ourliers are capped to -3.76693592675
91 number ourliers are floored to 7.2100333027499985 and 46 number ourliers are capped to -9.030329529249999
Out[181]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V31 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.464606 -4.679129 3.101546 0.506130 -0.221083 -2.032511 -2.910870 0.050714 -1.522351 3.761892 ... 1.667098 -1.690440 2.846296 2.235198 6.667486 0.443809 -2.369169 2.950578 -3.480324 0
1 3.365912 3.653381 0.909671 -1.367528 0.332016 2.358938 0.732600 -4.332135 0.565695 -0.101080 ... 0.024883 3.032780 -2.467514 1.894599 -2.297780 -1.731048 5.908837 -0.386345 0.616242 0
2 -3.831843 -5.824444 0.634031 -2.418815 -1.773827 1.016824 -2.098941 -3.173204 -2.081860 5.392621 ... -1.600395 0.803550 4.086219 2.292138 5.360850 0.351993 2.940021 3.839160 -4.309402 0
3 1.618098 1.888342 7.046143 -1.147285 0.083080 -1.529780 0.207309 -2.493629 0.344926 2.118578 ... 4.948770 -2.577474 1.363769 0.622714 5.550100 -1.526796 0.138853 3.101430 -1.277378 0
4 -0.111440 3.872488 -3.758361 -2.982897 3.792714 0.544960 0.205433 4.848994 -1.854920 -5.743597 ... 2.044184 6.629213 -7.497861 1.222987 -3.229763 1.686909 -2.163896 -3.644622 6.510338 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 -2.071318 -1.088279 -0.796174 -3.011720 -2.287540 2.807310 0.481428 0.105171 -0.586599 -2.899398 ... -3.938493 5.745013 0.589014 -0.649988 -3.043174 2.216461 0.608723 0.178193 2.927755 1
19996 2.890264 2.483069 5.643919 0.937053 -1.380870 0.412051 -1.593386 -5.762498 2.150096 0.272302 ... -1.088553 1.181466 -0.742412 5.368979 -0.693028 -1.668971 3.659954 0.819863 -1.987265 0
19997 -3.896979 -3.942407 -0.351364 -2.417462 1.107546 -1.527623 -3.519882 2.054792 -0.233996 -0.357687 ... 0.981858 1.476080 -3.953710 1.855555 5.029209 2.082588 -6.409304 1.477138 -0.874148 0
19998 -3.187322 -7.907373 5.695955 -4.370053 -5.354758 -1.873044 -3.947210 0.679420 -2.389254 5.456756 ... 1.914766 3.163661 3.752095 8.529894 8.450626 0.203958 -7.129918 4.249394 -6.112267 0
19999 -2.686903 1.961187 6.137088 2.600133 2.657241 -4.290882 -2.344267 0.974004 -1.027462 0.497421 ... 4.674280 -1.988786 -1.348901 3.951801 5.449706 -0.455411 -2.202056 1.678229 -1.974413 0

20000 rows × 35 columns

In [182]:
# outlier detection using boxplot


plt.figure(figsize=(15, 10))

for i, variable in enumerate(df.columns.drop('Target')):
    plt.subplot(7, 5, i + 1)
    sns.boxplot(data=df, x=variable)
    plt.tight_layout(pad=2)

plt.show()

All outliers are removed

In [183]:
plt.figure(figsize=(15, 10))

for i, variable in enumerate(df.columns.drop('Target')):
    plt.subplot(7, 5, i + 1)
    sns.histplot(data=df, x=variable)
    plt.tight_layout(pad=2)

plt.show()

Splitting data in to train validation and test

In [17]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Target',axis = 1), df['Target'], train_size= .85, random_state= 10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size= .8, random_state= 10)
In [18]:
X_train.shape
Out[18]:
(13600, 34)
In [110]:
X_val.shape
Out[110]:
(3400, 34)
In [111]:
X_test.shape
Out[111]:
(3000, 34)

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [119]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1
            
        },
        index=[0],
    )

    return df_perf

Defining scorer to be used for cross-validation and hyperparameter tuning¶

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
In [19]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data¶

Selecting 6 models for comparison and appending to a list

In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("logreg", LogisticRegression(random_state=1)))
models.append(("Randomforest", RandomForestClassifier(random_state=1)))
models.append(("Bagging_class", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1))))
models.append(("Adaboost_class", AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1))))
models.append(("SVM", SVC(kernel='linear', random_state=1)))

Checking for the 6 models with given dataset

In [31]:
results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.7236842105263157
logreg: 0.5210526315789473
Randomforest: 0.7092105263157895
Bagging_class: 0.7105263157894738
Adaboost_class: 0.7263157894736842
SVM: 0.48552631578947364

Validation Performance:

dtree: 0.6720430107526881
logreg: 0.44623655913978494
Randomforest: 0.6612903225806451
Bagging_class: 0.6182795698924731
Adaboost_class: 0.6559139784946236
SVM: 0.43010752688172044

Observations¶

  • The best 3 model seems to be Decision tree, Random forst and Adaboost in means of Recall value
  • Dtree has a recall value of 0.6720
  • Random forest has a recall value of 0.6612
  • Adaboost has a recall value of 0.6559

Model Building with Oversampled data¶

In [32]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=10, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [34]:
results2 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results2.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.9611370716510905
logreg: 0.8812305295950156
Randomforest: 0.9758566978193148
Bagging_class: 0.9658878504672896
Adaboost_class: 0.9607476635514018
SVM: 0.8801401869158878

Validation Performance:

dtree: 0.7580645161290323
logreg: 0.8387096774193549
Randomforest: 0.8279569892473119
Bagging_class: 0.7741935483870968
Adaboost_class: 0.7473118279569892
SVM: 0.8333333333333334

Observations¶

  • The best 3 model seems to be Decision Logistic regression, SVM and Random Forest in means of Recall value
  • Logistic Regression has a recall value of 0.8387
  • Random forest has a recall value of 0.8279
  • SVM has a recall value of 0.833

Model Building with Undersampled data¶

In [35]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [36]:
results3 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results3.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.8486842105263157
logreg: 0.8631578947368421
Randomforest: 0.8934210526315789
Bagging_class: 0.8578947368421053
Adaboost_class: 0.844736842105263
SVM: 0.8605263157894736

Validation Performance:

dtree: 0.8225806451612904
logreg: 0.8387096774193549
Randomforest: 0.8655913978494624
Bagging_class: 0.8333333333333334
Adaboost_class: 0.8279569892473119
SVM: 0.8333333333333334

Observations¶

  • The best 3 model seems to be Logistic regression, Random forst and Adaboost in means of Recall value
  • Logistic regression has a recall value of 08387
  • Random forest has a recall value of 0.8655
  • Bagging class has a recall value of 0.833

Models with Unersampled data and Over sampled data performed better than Orginal dataset. This is expected because the dataste was highly imbalanced¶

HyperparameterTuning¶

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

  • For Adaboost:

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

  • For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

  • For Random Forest:

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

  • For Decision Trees:

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

  • For Logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

  • For XGBoost:

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Sample tuning method for Decision tree with original data¶

Choosing the parameter grid for the selected models

In [37]:
param_grid_1 = {
    'max_depth': np.arange(2,6), 
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}

param_grid_2 = {'C': np.arange(0.1,1.1,0.1)}

param_grid_3 = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}

param_grid_4 = {
    'max_samples': [0.8,0.9,1], 
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}

param_grid_5 = {
    "n_estimators": [100, 150, 200],
    "learning_rate": [0.2, 0.05],
    "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}

param_grid_6 = {
    'C': [0.1, 1, 10],          # Regularization parameter
    'kernel': ['linear', 'rbf', 'poly'],  # Kernel type
    'degree': [2, 3, 4],        # Degree of the polynomial kernel (only for 'poly' kernel)
    'gamma': ['scale', 'auto'], # Kernel coefficient (only for 'rbf' and 'poly' kernels)
}


param_grid = [param_grid_1,param_grid_2,param_grid_3,param_grid_4,param_grid_5,param_grid_6]

Fine tuning the best three model selected using orginal dataset

In [56]:
models_rscv_1 = []  # Empty list to store all the models

# Appending models into the list
models_rscv_1.append((param_grid_1,"dtree", DecisionTreeClassifier(random_state=1)))
models_rscv_1.append((param_grid_3,"Randomforest", RandomForestClassifier(random_state=1)))
models_rscv_1.append((param_grid_5,"Adaboost_class", AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1))))
In [58]:
best_model_normal = []
for param, name, model in models_rscv_1:
#Calling RandomizedSearchCV
    randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

    #Fitting parameters in RandomizedSearchCV
    randomized_cv.fit(X_train,y_train)

    best_params_1 = randomized_cv.best_params_
    best_model_1 = randomized_cv.best_estimator_
    best_model_normal.append((name,best_model_1))

    model_check = best_model_1.fit(X_train,y_train)
    scores = recall_score(y_val, model_check.predict(X_val))
    
    print(" The {} model's Best parameters are {} with CV score={}:" .format(name, randomized_cv.best_params_,randomized_cv.best_score_))
    print("Validation score of {}: {}".format(name, scores))
 The dtree model's Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.525:
Validation score of dtree: 0.489247311827957
 The Randomforest model's Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.6881578947368421:
Validation score of Randomforest: 0.6397849462365591
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
 The Adaboost_class model's Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.7592105263157894:
Validation score of Adaboost_class: 0.7096774193548387

Sample tuning method for Decision tree with oversampled data¶

In [63]:
models_rscv_2 = []  # Empty list to store all the models

# Appending models into the list
models_rscv_2.append((param_grid_3,"Randomforest", RandomForestClassifier(random_state=1)))
models_rscv_2.append((param_grid_2,"logreg", LogisticRegression(random_state=1)))
models_rscv_2.append((param_grid_6,"SVM", SVC(random_state=1)))

Fine tuning the best three model selected using Oversampled dataset

In [64]:
best_model_over = []
for param, name, model in models_rscv_2:
#Calling RandomizedSearchCV
    randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

    #Fitting parameters in RandomizedSearchCV
    randomized_cv.fit(X_train,y_train)

    best_params_1 = randomized_cv.best_params_
    best_model_1 = randomized_cv.best_estimator_
    best_model_over.append((name,best_model_1))

    model_check = best_model_1.fit(X_train,y_train)
    scores = recall_score(y_val, model_check.predict(X_val))
    
    print(" The {} model's Best parameters are {} with CV score={}:" .format(name, randomized_cv.best_params_,randomized_cv.best_score_))
    print("Validation score of {}: {}".format(name, scores))
 The Randomforest model's Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.6881578947368421:
Validation score of Randomforest: 0.6397849462365591
 The logreg model's Best parameters are {'C': 0.1} with CV score=0.5131578947368421:
Validation score of logreg: 0.44623655913978494
 The SVM model's Best parameters are {'kernel': 'rbf', 'gamma': 'scale', 'degree': 4, 'C': 10} with CV score=0.8605263157894736:
Validation score of SVM: 0.8225806451612904

Sample tuning method for Decision tree with undersampled data¶

In [41]:
models_rscv_3 = []  # Empty list to store all the models

# Appending models into the list
models_rscv_3.append((param_grid_2,"logreg", LogisticRegression(random_state=1)))
models_rscv_3.append((param_grid_3,"Randomforest", RandomForestClassifier(random_state=1)))
models_rscv_3.append((param_grid_4,"Bagging_class", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1))))

Fine tuning the best three model selected using Undersampled dataset

In [54]:
best_model_under = []
for param, name, model in models_rscv_3:
#Calling RandomizedSearchCV
    randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

    #Fitting parameters in RandomizedSearchCV
    randomized_cv.fit(X_train_un,y_train_un)

    best_params_1 = randomized_cv.best_params_
    best_model_1 = randomized_cv.best_estimator_
    best_model_under.append((name,best_model_1))
    
    model_check = best_model_1.fit(X_train_un,y_train_un)
    scores = recall_score(y_val, model_check.predict(X_val))
    
    print(" The {} model's Best parameters are {} with CV score={}:" .format(name, randomized_cv.best_params_,randomized_cv.best_score_))
    print("Validation score of {}: {}".format(name, scores))
 The logreg model's Best parameters are {'C': 0.30000000000000004} with CV score=0.868421052631579:
Validation score of logreg: 0.8387096774193549
 The Randomforest model's Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.8960526315789472:
Validation score of Randomforest: 0.8655913978494624
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
/opt/homebrew/lib/python3.10/site-packages/sklearn/ensemble/_base.py:166: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4.
  warnings.warn(
 The Bagging_class model's Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.8} with CV score=0.8907894736842106:
Validation score of Bagging_class: 0.8548387096774194

Model performance comparison and choosing the final model¶

In [68]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(models, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    
    
    accuracy_list = []
    recall_list = []
    precision_list = []
    f1_score_list = []
    index_list = []

    for i in range (len(models)):
        pred = models[i].predict(predictors)
        acc = accuracy_score(target, pred)  # to compute Accuracy
        accuracy_list.append(acc)
    
        recall = recall_score(target, pred)  # to compute Recall
        recall_list.append(recall)
    
        precision = precision_score(target, pred)  # to compute Precision
        precision_list.append(precision)
    
        f1 = f1_score(target, pred)  # to compute F1-score
        f1_score_list.append(f1)
        
        model_name = 'Model_' + str(i)
        index_list.append(model_name)
    

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": accuracy_list,
            "Recall": recall_list,
            "Precision": precision_list,
            "F1": f1_score_list
            
        },
        index= index_list,
    )

    return df_perf
In [65]:
all_9_models = best_model_normal + best_model_over + best_model_under 
In [71]:
all_9_models[0][1]
Out[71]:
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
                       min_impurity_decrease=0.0001, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
                       min_impurity_decrease=0.0001, random_state=1)
In [72]:
all_9_models_only = [i[1] for i in all_9_models]
In [73]:
all_9_models_only
Out[73]:
[DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
                        min_impurity_decrease=0.0001, random_state=1),
 RandomForestClassifier(max_samples=0.6, n_estimators=300, random_state=1),
 AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                          random_state=1),
                    learning_rate=0.2, n_estimators=200),
 RandomForestClassifier(max_samples=0.6, n_estimators=300, random_state=1),
 LogisticRegression(C=0.1, random_state=1),
 SVC(C=10, degree=4, random_state=1),
 LogisticRegression(C=0.30000000000000004, random_state=1),
 RandomForestClassifier(max_samples=0.6, n_estimators=300, random_state=1),
 BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                   max_features=0.8, max_samples=0.8, n_estimators=70)]
In [74]:
model_performance_classification_sklearn(all_9_models_only, X_train, y_train)
Out[74]:
Accuracy Recall Precision F1
Model_0 0.974779 0.590789 0.933472 0.723610
Model_1 0.994706 0.907895 0.997110 0.950413
Model_2 0.999412 0.989474 1.000000 0.994709
Model_3 0.994706 0.907895 0.997110 0.950413
Model_4 0.968750 0.517105 0.871397 0.649050
Model_5 0.993971 0.893421 0.998529 0.943056
Model_6 0.865515 0.865789 0.275891 0.418442
Model_7 0.947574 0.976316 0.516354 0.675467
Model_8 0.948603 0.994737 0.521020 0.683853
In [75]:
model_performance_classification_sklearn(all_9_models_only, X_val, y_val)
Out[75]:
Accuracy Recall Precision F1
Model_0 0.965588 0.489247 0.805310 0.608696
Model_1 0.979118 0.639785 0.967480 0.770227
Model_2 0.981176 0.709677 0.929577 0.804878
Model_3 0.979118 0.639785 0.967480 0.770227
Model_4 0.963529 0.446237 0.798077 0.572414
Model_5 0.989412 0.822581 0.980769 0.894737
Model_6 0.858529 0.838710 0.257002 0.393443
Model_7 0.936471 0.865591 0.457386 0.598513
Model_8 0.933824 0.854839 0.445378 0.585635
In [76]:
model_performance_classification_sklearn(all_9_models_only, X_test, y_test)
Out[76]:
Accuracy Recall Precision F1
Model_0 0.965667 0.475610 0.821053 0.602317
Model_1 0.981000 0.658537 0.990826 0.791209
Model_2 0.986333 0.756098 0.992000 0.858131
Model_3 0.981000 0.658537 0.990826 0.791209
Model_4 0.966000 0.463415 0.844444 0.598425
Model_5 0.991000 0.841463 0.992806 0.910891
Model_6 0.866667 0.829268 0.267717 0.404762
Model_7 0.945333 0.896341 0.500000 0.641921
Model_8 0.944333 0.890244 0.494915 0.636166

Test set final performance¶

In [144]:
df_test = pd.read_csv('/Users/anshamohammed/Desktop/Drive G/specialised course/Feature_eng/Project/Test.csv.csv')
df_test
Out[144]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613489 -3.819640 2.202302 1.300420 -1.184929 -4.495964 -1.835817 4.722989 1.206140 -0.341909 ... 2.291204 -5.411388 0.870073 0.574479 4.157191 1.428093 -10.511342 0.454664 -1.448363 0
1 0.389608 -0.512341 0.527053 -2.576776 -1.016766 2.235112 -0.441301 -4.405744 -0.332869 1.966794 ... -2.474936 2.493582 0.315165 2.059288 0.683859 -0.485452 5.128350 1.720744 -1.488235 0
2 -0.874861 -0.640632 4.084202 -1.590454 0.525855 -1.957592 -0.695367 1.347309 -1.732348 0.466500 ... -1.318888 -2.997464 0.459664 0.619774 5.631504 1.323512 -1.752154 1.808302 1.675748 0
3 0.238384 1.458607 4.014528 2.534478 1.196987 -3.117330 -0.924035 0.269493 1.322436 0.702345 ... 3.517918 -3.074085 -0.284220 0.954576 3.029331 -1.367198 -3.412140 0.906000 -2.450889 0
4 5.828225 2.768260 -1.234530 2.809264 -1.641648 -1.406698 0.568643 0.965043 1.918379 -2.774855 ... 1.773841 -1.501573 -2.226702 4.776830 -6.559698 -0.805551 -0.276007 -3.858207 -0.537694 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 -5.120451 1.634804 1.251259 4.035944 3.291204 -2.932230 -1.328662 1.754066 -2.984586 1.248633 ... 9.979118 0.063438 0.217281 3.036388 2.109323 -0.557433 1.938718 0.512674 -2.694194 0
4996 -5.172498 1.171653 1.579105 1.219922 2.529627 -0.668648 -2.618321 -2.000545 0.633791 -0.578938 ... 4.423900 2.603811 -2.152170 0.917401 2.156586 0.466963 0.470120 2.196756 -2.376515 0
4997 -1.114136 -0.403576 -1.764875 -5.879475 3.571558 3.710802 -2.482952 -0.307614 -0.921945 -2.999141 ... 3.791778 7.481506 -10.061396 -0.387166 1.848509 1.818248 -1.245633 -1.260876 7.474682 0
4998 -1.703241 0.614650 6.220503 -0.104132 0.955916 -3.278706 -1.633855 -0.103936 1.388152 -1.065622 ... -4.100352 -5.949325 0.550372 -1.573640 6.823936 2.139307 -4.036164 3.436051 0.579249 0
4999 -0.603701 0.959550 -0.720995 8.229574 -1.815610 -2.275547 -2.574524 -1.041479 4.129645 -2.731288 ... 2.369776 -1.062408 0.790772 4.951955 -7.440825 -0.069506 -0.918083 -2.291154 -5.362891 0

5000 rows × 41 columns

In [145]:
df_test.isnull().sum()
Out[145]:
V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64
In [146]:
imputation_dict = {'V1': df_test.V1.median(), 'V2' : df_test.V2.mean()}
df_test.fillna(imputation_dict, inplace= True)
In [147]:
df_test.drop(highly_correlated_cols, axis = 1, inplace= True)
In [148]:
X = df_test.drop('Target', axis = 1)
y = df_test['Target']
In [149]:
model_performance_classification_sklearn(all_9_models_only, X, y)
Out[149]:
Accuracy Recall Precision F1
Model_0 0.9648 0.485816 0.815476 0.608889
Model_1 0.9790 0.645390 0.973262 0.776119
Model_2 0.9850 0.762411 0.964126 0.851485
Model_3 0.9790 0.645390 0.973262 0.776119
Model_4 0.9648 0.471631 0.831250 0.601810
Model_5 0.9904 0.843972 0.983471 0.908397
Model_6 0.9648 0.471631 0.831250 0.601810
Model_7 0.9412 0.875887 0.488142 0.626904
Model_8 0.9412 0.872340 0.488095 0.625954

Observations¶

  • The best model is find to be the Model_5 which is an SVC model made using oversampled data
  • The SVC Model gives an Overall performance of accuracy of 0.9904 Recall of 0.843972 precision of 0.983471 and f1 score of 0.908397.
  • But Here we can also use Model 7 and Model 8 which is generated using undersampled data and Random forest and bagging classifier.
  • Model 7 gives the Recall values of 87.58 and model 8 gives 87.23.

Pipelines to build the final model¶

In [ ]:
# to create pipeline and make_pipeline
from sklearn.pipeline import Pipeline, make_pipeline
In [157]:
#Selecting the best Model
all_9_models_only[5]
Out[157]:
SVC(C=10, degree=4, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=10, degree=4, random_state=1)
In [159]:
#Making imputer pipeline
numeric_transformer1 = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
numeric_transformer2 = Pipeline(steps=[("imputer", SimpleImputer(strategy="mean"))])
In [100]:
def drop_cols (df, highly_correlated_cols):
    return df.drop(highly_correlated_cols, axis = 1).transform(df)
In [160]:
#Making imputer Column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num1", numeric_transformer1, ['V1']),
        ("num2", numeric_transformer2, ['V2']),
    ],remainder="passthrough",)
In [161]:
preprocessor
Out[161]:
ColumnTransformer(remainder='passthrough',
                  transformers=[('num1',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['V1']),
                                ('num2',
                                 Pipeline(steps=[('imputer', SimpleImputer())]),
                                 ['V2'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough',
                  transformers=[('num1',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['V1']),
                                ('num2',
                                 Pipeline(steps=[('imputer', SimpleImputer())]),
                                 ['V2'])])
['V1']
SimpleImputer(strategy='median')
['V2']
SimpleImputer()
passthrough
In [164]:
# Creating new pipeline with best parameters
pipe = Pipeline(
    steps=[("pre", preprocessor),
           ("LGR",all_9_models_only[5])])
In [166]:
# Fit the model on training data
pipe.fit(X_train, y_train)
Out[166]:
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num1',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['V1']),
                                                 ('num2',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer())]),
                                                  ['V2'])])),
                ('LGR', SVC(C=10, degree=4, random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num1',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['V1']),
                                                 ('num2',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer())]),
                                                  ['V2'])])),
                ('LGR', SVC(C=10, degree=4, random_state=1))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('num1',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['V1']),
                                ('num2',
                                 Pipeline(steps=[('imputer', SimpleImputer())]),
                                 ['V2'])])
['V1']
SimpleImputer(strategy='median')
['V2']
SimpleImputer()
['V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V17', 'V18', 'V19', 'V20', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V30', 'V31', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
passthrough
SVC(C=10, degree=4, random_state=1)
In [167]:
recall = recall_score(y, pipe.predict(X))  # to compute Recall
recall
Out[167]:
0.8439716312056738

Business Insights and Conclusions¶


  • The SVC Model with over sampling gives overall good performance compare with other models, it has hive accuracy, recall, precision and f1 score as 0.9904, 0.843972, 0.983471 and 0.908397.
  • But Considering only recall value, Random forest model with under sampling gives better result. It gives a recall value of 0.876 but precision and the f1 score is poor.
  • As an overall better model it is better to use SVC model than the Random forest